Taking an airplane is one of the most important and efficient ways to travel. However, many travelers have experienced delayed flight. Which airline carriers delayed most often? Which airports have highest probability to make you wait for a long time? Also, which day of week and which month of year are better for your journey without severe delay?
The aim of this presentation is to provide statistics accompanied by charts and discussions to better understand the dataset.
For this project, we will be exploring the airline on-time performance from a publicly available dataset from the stat-computing website (see reference at the end of this document).
The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008.
The dataset is packed in a yearly chunks from 1987 to 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
To make sure we will be not overwhelmed by the size of the data, we will create a subset file that will be used for our Exploratory Data Analysis task.
For a deeper exaplanation about the data preparation and the SQL code, please refer to this page where every steps are explained in details.
## [1] 800000 29
There are 800,000 flights from 1997 till 2008 with 32 features.
## 'data.frame': 800000 obs. of 29 variables:
## $ Year : int 1997 1994 1990 1993 1996 2006 1993 2008 2003 2007 ...
## $ Month : int 8 6 11 3 2 1 8 12 1 11 ...
## $ DayofMonth : int 18 16 3 29 1 9 9 24 5 6 ...
## $ DayOfWeek : int 1 4 6 1 4 1 1 3 7 2 ...
## $ DepTime : int 1038 2049 1855 810 1245 1537 730 904 1012 1804 ...
## $ CRSDepTime : int 940 1900 1855 809 1245 1545 730 845 1004 1740 ...
## $ ArrTime : int 1236 2243 2012 939 1441 1630 954 1058 1643 2132 ...
## $ CRSArrTime : int 1130 2035 2020 943 1425 1632 950 1034 1624 2110 ...
## $ UniqueCarrier : Factor w/ 29 levels "9E","AA","AQ",..: 27 6 26 2 8 28 4 4 14 2 ...
## $ FlightNum : int 1094 489 359 1412 827 2599 504 401 81 177 ...
## $ TailNum : Factor w/ 12932 levels "","-N037M","-N047M",..: 7632 NA NA NA 11664 1327 NA 7698 8089 4594 ...
## $ ActualElapsedTime: int 118 114 77 89 116 53 144 114 271 388 ...
## $ CRSElapsedTime : int 110 95 85 94 100 47 140 109 260 390 ...
## $ AirTime : int 108 NA NA NA 91 26 NA 88 240 345 ...
## $ ArrDelay : int 66 128 -8 -4 16 -2 4 24 19 22 ...
## $ DepDelay : int 58 109 0 1 0 -8 0 19 8 24 ...
## $ Origin : Factor w/ 333 levels "ABE","ABI","ABQ",..: 223 92 294 86 244 148 236 287 240 165 ...
## $ Dest : Factor w/ 335 levels "ABE","ABI","ABQ",..: 284 104 285 32 21 176 174 238 51 285 ...
## $ Distance : int 671 487 372 597 526 127 834 569 1999 2586 ...
## $ TaxiIn : int 4 NA NA NA 14 5 NA 17 7 5 ...
## $ TaxiOut : int 6 NA NA NA 11 22 NA 9 24 38 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : Factor w/ 5 levels "","A","B","C",..: NA NA NA NA NA 1 NA 1 NA 1 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CarrierDelay : int NA NA NA NA NA 0 NA 7 NA 22 ...
## $ WeatherDelay : int NA NA NA NA NA 0 NA 0 NA 0 ...
## $ NASDelay : int NA NA NA NA NA 0 NA 5 NA 0 ...
## $ SecurityDelay : int NA NA NA NA NA 0 NA 0 NA 0 ...
## $ LateAircraftDelay: int NA NA NA NA NA 0 NA 12 NA 0 ...
The data comes originally from RITA where it is described in detail.
Here is a summary of these features:
## Description Name
## 1 1987-2008 Year
## 2 1-12 Month
## 3 1-31 DayofMonth
## 4 1 (Monday) - 7 (Sunday) DayOfWeek
## 5 actual departure time (local, hhmm) DepTime
## 6 scheduled departure time (local, hhmm) CRSDepTime
## 7 actual arrival time (local, hhmm) ArrTime
## 8 scheduled arrival time (local, hhmm) CRSArrTime
## 9 unique carrier code UniqueCarrier
## 10 flight number FlightNum
## 11 plane tail number TailNum
## 12 in minutes ActualElapsedTime
## 13 in minutes CRSElapsedTime
## 14 in minutes AirTime
## 15 arrival delay, in minutes ArrDelay
## 16 departure delay, in minutes DepDelay
## 17 origin (IATA airport code) Origin
## 18 destination (IATA airport code) Dest
## 19 in miles Distance
## 20 taxi in time, in minutes TaxiIn
## 21 taxi out time in minutes TaxiOut
## 22 was the flight cancelled? Cancelled
## 23 reason for cancellation CancellationCode
## 24 1 = yes, 0 = no Diverted
## 25 in minutes CarrierDelay
## 26 in minutes WeatherDelay
## 27 in minutes NASDelay
## 28 in minutes SecurityDelay
## 29 in minutes LateAircraftDelay
Two features has been built on top of the existing to facilitate the exploration exercice.
First, the feature speed has been created, the calculation behind this variable is very simple, we take the distance and divide by the AirTime (flight time) variable (in hours). This give us the speed in miles / hour.
Next, the feature date has also been created to simply reflect the departure date in a date format. We use this format because the data set only offer the data as three separate features (Year, Month, DayOfMonth)
Note:
Reason for cancellation can take 4 possible value (A = carrier, B = weather, C = NAS, D = security).
We have 29 carriers in the dataset.
We have 335 destination airports and 333 origin airports.
The delays in the dataset is measured in minutes, we will often convert this unit in hours based on the analysis we want to perform.
It is effective to devide the numeric variable into the ordered, reasonable range for our analysis. So we split the Distance, ArrDelay, Airtime, CRSDepTime, CRSArrTime respectively and make sure that all the observation is included in given range.
The split of the data is made based on the observation of the output from the summary command. This commande provide the five-number summary, a concise summary of the distribution of the observations. With that we have an information about the location (from the median), the spread (from the quartiles) and the range (from the sample minimum and maximum) of the observations.
1/ Distance of flights in miles
summary(rita$Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 11.0 306.0 544.0 701.4 936.0 4983.0 1263
The data distribution is left (positive) skewed. We can see that the mean is larger than the median.
Most flights are coverining medium distance. The number of long and short distance are comparable.
2/ Arrival delays in minutes
summary(rita$ArrDelay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1162.000 -7.000 0.000 7.064 11.000 1541.000 16745
Besides the NA values, most of the flights land on-time. We have 16745 NA’s in the dataset for this variable.
3/ Flight time
summary(rita$AirTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1422.0 54.0 84.0 102.8 133.0 1603.0 254380
Most flight are short in time (less than an hour) or intermediate (from one to 2 hours); long flight (more than 3 hours) are less frequent.
4/ Scheduled departure time (CRS is the Computer Reservation System)
summary(rita$CRSDepTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 930 1327 1335 1729 2400
Most of the flights are scheduled during the morning or afternoon and less frequently the evening.
5/ Scheduled arrival time (CRS is the Computer Reservation System)
summary(rita$CRSArrTime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1115 1520 1491 1912 2400
Most flights land the afternoon and the evening. This make sence based on the previous histogram.
The chart show major US airlines (in number of flights), we can see major companies such American Airlines, Delta, United.
For reference, we provide a glossary of abbreviations. Also note that America West Airlines inc. and US Airways Inc. has been merged in October 2007.
| UniqueCarrier | Description |
|---|---|
| 9E | Pinnacle Airlines Inc. |
| AA | American Airlines Inc. |
| AQ | Aloha Airlines Inc. |
| AS | Alaska Airlines Inc. |
| B6 | JetBlue Airways |
| CO | Continental Air Lines Inc. |
| DH | Independence Air |
| DL | Delta Air Lines Inc. |
| EA | Eastern Air Lines Inc. |
| EV | Atlantic Southeast Airlines |
| F9 | Frontier Airlines Inc. |
| FL | AirTran Airways Corporation |
| HA | Hawaiian Airlines Inc. |
| HP | America West Airlines Inc. (Merged with US Airways 9/05. Stopped reporting 10/07.) |
| ML (1) | Midway Airlines Inc. (1) |
| MQ | American Eagle Airlines Inc. |
| NW | Northwest Airlines Inc. |
| OH | Comair Inc. |
| OO | Skywest Airlines Inc. |
| PA (1) | Pan American World Airways (1) |
| PI | Piedmont Aviation Inc. |
| PS | Pacific Southwest Airlines |
| TW | Trans World Airways LLC |
| TZ | ATA Airlines d/b/a ATA |
| UA | United Air Lines Inc. |
| US | US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.) |
| WN | Southwest Airlines Co. |
| XE | Expressjet Airlines Inc. |
| YV | Mesa Airlines Inc. |
Let’s show the number of flight (flight volume) we have per year.
We see from this plot a continuity in the growth of number of flight. There are less flight in the 80th than in the 2008. This is algin with a logical thinking and also the size of the files used to load the database. We removed the year 1987 because we only have the 3 last months of that year.
With that we can assume the random selection of the sample has been made correctly and we are not privileging a year over another.
We see a drop in the number of fligh between 2001 and 2003, probably due to the event of September 11, 2001, however, flight volume is on the increase - dramatically so since 2000.
Let’s now focus the analysis on the delays. The next four plots are showing on the left the number (up) and the density (down) of departure delay and on the right the same for arrival delay.
We see that most flight have no delays. The distribution for the departure seems a bit negatively skewed compare to arrival where the distribution is normal.
The first quartile is at the 25th percentile, we see a clear drop in this quartile and an increase of the upper quantile (75%). This means that we have less and less of the data smaller than the first quartile so a substential increase of the delay year after year. This is true for both departure and arrival but certainly more pronounced for departure.
In this dataset, we have almost all flight not diverted
Also, not many flight cancelled.
We have access to the cancellation code, let’s see the proportion of cancellation code.
Let’s look at the number of flights per destination.
There are way too many destinations. In this case, the best thing to do is subset to the top 10 destinations.
Since we’ve already got the top destinations in a table format with the number of plane, it’s a good opportunity to display a visualization of these two variables.
No susprise, the airports with the highest number of plane are those having the highest number of flights.
We will show the distribution of the distance for each flight.
The histogram of distance is left skewed so I’m going to transform the data using a log transform.
I log-transformed the left skewed distance distribution. However this transformation does not provide any added value or sence to the data.
Due to my computer limits, for the next operation I only selected 100 flights.
From this plot we see that we don’t have a lot of obvious correlation between features however, let’s look at some of them.
For large scale commercial airplane, speed in flight is 500 MPH.
Let’s use 600 MPH as a limit.
As speed increase, the distance increase. This makes a lot of sence.
The relationship between speed and distance appears to be exponential rather than linear.
With a basic linear model, we have an \(R^2\) value of .4549 therefore, speed explains about 45 percent of the variance in distance. The fitted regression line is : Distance = -1221.7 + 4.922 speed. According to this model, average distance increases by 4.922 miles for each additional mile per hour of speed.
The residuals are the vertical distances from the observed distance to the line.
The residual plot (see billow) has some unisually large residuals labeled; observations 222924 for example. We also observe that residuals are closer to zero at low distances; the variance of the residuals is not constant across all speed, but increasing with speed.
If we try to predict the distance for a speed of 300 mph, we got a flight distance of 255 miles.
For inference, we requires some asumptions about the distribution of the error term. We assume that the random errors are independant and indentically distributed as Normal random variables.
Residual plots help us to assess the fit of the model and the assumptions for the error.
Here we are requested two plots: a plot of residuals vs fit and a QQ plot to check for normality of residuals.
In the residual plot a curve has been added. This curve is a fitted lowess (local polynomial regression) curve, called a smoother. The residuals are assummed independant and identivally distributed, but there is a pattern evident. The residuals have a “U” shape or bowl shape. This pattern could indicate that there is a variable missing from the model. In the QQ plot, normally distributed residuals should lie approximately along the reference line shown in the plot. We clearly see that this is not really the case. Therefore, our model regression model is not really a good fit for predicting distance of flight for a given speed.
Let’s try an exponential regression.
We are testing 3 different kind of models respectively * m2 : Distance ~ speed + I(speed^2) * m3 : Distance ~ speed + I(speed^2) + I(speed^3) * m4 : Distance ~ I(speed^2)
With the model m3 we have an \(R^2\) of 52.7% which is much better.
Also the residual plot has a line shape but still the QQplot normally distributed residuals does not lie along the reference line shown in the plot.
#summary(m3)
plot(m3, which=1:2)
Finally, let’s look at the speed vs distance with less overplotting.
From this last plot we start to see something interesting, we see several points which are disconnected from the main dots. This probably show different kind of plane. We start to see 3 exponentionals curves.
The red line depict the average over the distance values for each speed value.
The plot suggest that the number of flight are similar with respect to variance, but with possibly different centers (different location). The Saterday and Friday are farther apart that the Tuesday and Wednesday for example, but it is not easy to tell from the plot if these differences are significant. Let us compare the means and standard deviations of the 7 groups.
## rita.per_day$DayOfWeek: Monday
## mean sd
## 106.20649 18.60073
## --------------------------------------------------------
## rita.per_day$DayOfWeek: Tuesday
## mean sd
## 105.23805 17.74649
## --------------------------------------------------------
## rita.per_day$DayOfWeek: Wednesday
## mean sd
## 105.94680 18.56581
## --------------------------------------------------------
## rita.per_day$DayOfWeek: Thursday
## mean sd
## 105.42020 19.47367
## --------------------------------------------------------
## rita.per_day$DayOfWeek: Friday
## mean sd
## 105.98287 19.50268
## --------------------------------------------------------
## rita.per_day$DayOfWeek: Saterday
## mean sd
## 92.85663 15.50580
## --------------------------------------------------------
## rita.per_day$DayOfWeek: Sunday
## mean sd
## 99.71957 18.48723
The standard deviations are close for the seven groups, which we observed from the boxplots. However, the means of the groups are also close;
Are there “significant” differences among the means ?
We can answer this question by conducting a quick One-way ANOVA test. In this case, the null hypothesis is that the mean of number of flight is equal for the seven days of the weeks.
The R oneway.test function is a simple way to obtain the one-way analysis of variance. It is valid if certain conditions are satisfied: the errors are \(NID(0,\sigma^2)\), where “NID” is an abbreviation for “normally distributed and independent.” The symbol \(\sigma^2\) is the error variance, and stating that errors are \(NID(0,\sigma^2)\) also means that the variance of random error is equal for all groups. In the oneway.test function, there is an adjustment made (to degrees of freedom for error) for unequal variance. Normal error distribution can be checked using Normal-QQ plots of residuals.
In the Normal-QQ plot the residuals should lie approximately along the reference line. The plots do not reveal any severe departure from the assumptions for this model.
Let’s now perform the One-way ANOVA test :
##
## One-way analysis of means
##
## data: rita.per_day$count and rita.per_day$DayOfWeek
## F = 83.9766, num df = 6, denom df = 7756, p-value < 2.2e-16
Just looking at the p-value (< 2.2e-16), we can conclude to reject \(H_0\) at 5% significance.
We see Saterday and Sunday are two days with less delay, Saterday is the day with less flights in general. Let’s look more closely.
Best days to travel and avoid delays are Saturdays, Sunday but also Tuesdays or Wednesdays. Fridays are bad for delays. Also note the pick than the drop arround 2001, this is probably representative to the 9/11 event.
Next, we can visualize the number of flight for each months with a separation between the weekend and weekday. We will do this exercice for the year 2008.
Definitively, we see less flights during the weekend however, notice July and August (and also at the end of June) where this tendency seems to be opposite, the number of flights is higher the weekend versus the weekday.
We can also show the best time in the day for traveling with a separation between weekday from weekend. For this exercice we will select the year 2008 and the month of October. This month has been selected because it has no known public holliday that may impact the visulization.
This plot is showing the fluctuations of the number of flights during a day, in particular the month of October 2008. The lines are connecting the means respectively for the weekday and the weekends.
During the weekday, we clearly see that we have more flights at the begining of the day (6:00 and 7:00 AM) than it remain stable till 3:00 PM where we have a small drop. Later arrount 6:00 PM, we again have a peak of flight but not as high as in the morning.
During the weekend, the peaks happen arround 9:00 AM and at 1:00 PM
We have almost no flight from midnight till 5:00 AM.
It also seems that we have more outlier data during the after-noon especially arround 2 pm.
This tendencies are repeating accross other months of the same year (not shown here)
Next, let’s visualize the relationship between taxi-in and the origin. Taxi indicates the movement of an aircraft on the ground. Here we will analyse the taxiing immediately after the landing (taxi-in).
For most of the flights, the taxiing time spent after landing is between 3 and 5 minutes. Let’s compute the relative frequencies and display the distribution of this value.
From this last plot, we see 4 groups of flights emerging, each at 1 minutes interval. Most of the flights are in the group of 4 and 5 minutes.
The objects of interest are the outliers, In this case we will only choose destinations with “average delay” higher than 15 minutes.
This plot is a bit messy but show some interesting facts. Some destination airports (WRD, ASE) have very low average delay despit the number of fligh they accumulate but on the other hand, some airports (RDR) have very high average delay with very few flights.
We will analyse, only delays higher than 10 minutes.
As the flight time increase the arrival delay is increasing.
The next last plot also show the density of flights (on a log scale). As the time increase, we have less flights but interestingly, the delay increase significantly.
Let’s visualize the delay over the months and see which months are better
In a table format, we can compute the delay per carrier for each months and visualize these data. We will visualize the carrier with “average delay” higher than 15 minutes.
This plot is not the best to display a trend because the coloring stacking is hidding things however we see that some airport are better in delay at the begining of the year and other at the end of the year. So we start to see a seasonal effect in the delays.
When the time is a continuous predictor it is sometimes interesting to use the logit transofrmation.
If values are naturally restricted to be in the range 0 to 1, not including the end-points, then a logit transformation may be appropriate.
A logit is the defined as the logarithm of the odds. If \(p\) is the probability of an event, then \((1 – p)\) is the probability of not observing the event, and the odds of the event are \(\frac{p}{(1 – p)}\). Hence, the logit is
\[ logit(p) = log \frac{p}{(1-p)} \]
The logit is undefined when \(p = 0\) or \(p = 1.0\). This is not a problem with because the logit transformation is applied to a predicted probability which can be shown to always be greater than 0 and less than 1.0.
Here we visualize the logit transformation applied to the mean of delay (higher than 10 minutes) for each days of the dataset.
This last plot is showing the logit transofrmation applied for the arrival delay.
The black line show the average probablibilites per month.
We clearly see peaks and drops repeating year after year.
Let’s visualize three consecutive years if we can see some seasonaibility effect.
We see the drop in probabilities of arrival delays during the summer holliday (July, August) as well as at the end of the year (December, January). This make sense, more people are traveling during summer and during the year-end celebrations.
Finally, we will use the logit function we’ve explored in the previous section to visually show the arrvival delays in a calendar view.
We see similarly the same seasonality effect observed in the previous plot. December, January, July and August are the months where the probability to have highest delay is the maximum. We olso notice the very darks cells in September 2001 reflecting the W.T.C. attacks.
From this last drawing, we recognize the seasonal effect discussed already.
This plot show a map of the average arrival delay per airport in US. The airport for which we see the dots in puprle are those having an average arrival delay of at least 20 minutes. We notice that these airports are on the east of the map. The airport with the code RDR (Grand Forks AFB) is having the highest average delay of 93 minutes (we consider this airport as an outlier and therefore it has been removed from the drawing).
The reasons why these airports are showing highest delay is not clear. One reason could be the impact of the weather conditions. After some investigation, the tornados for example are more likely to happen on the south & east part of the USA.
An other reason could be the highest number of flight deserving the east part of the US.
For this plot we will show the delay on a hourly arrival delay with various statistics.
This plot is showing the increase of arrival delays from multiple angles. The y-axis depicts the ratio between min and max delay. This information is shown as bars for every days and with a red/brun line for each year.
We see that the minimum delay is not really increasing or decreasing from year to year but the maximum delay is on the other hand, increasing regularly. We certainly see the highest increase when we left the 19th.
Still on the bottom, two areas are maked with different colors. The red color show the year of the W.T.C attack and the blue area show the “recovery” year. We can notice the pick in delays after 2001 probably due to a reinforcement of the security measure which has probably caused many delays. Interestingly, 2001 is not the year where we had the highest delay.
Finally, the top part of the graph show the IQR. The IQR is the difference between the upper and lower quartiles. In other words, the IQR is the 1st quartile subtracted from the 3rd. In red we draw the upper quartile (75%) calculated on a yearly basis and in green the lower quartile (25%) also calculated on a yearly basis. This time, the increase is not really clear. We can than conclude that delays are not more or less frequent from year to year but certainly when we have delay, this delay is more important (as shown at the bottom).
For the third plot we will try to visualize the impact of the September 11, 2001 attacks on the flight volume for the busiest airport of USA in that year. For that we will create a new dataset containing information about 5 airports between July and December 2001. For the selection of the airports we will take only the busiest airports by passenger in 2001 (wikipedia link).
These airports are :
This plot show the change in major airports daily flight volume after WTC attacks.
We clearly see the drop on the 9/11. The recovery of the flight volume has been almost immediate once the flight traffic has been re-opened after the event but not at the same level as before the attacks.
We can notice how in Chicago for example, the volume dropped from 1000 to 50 after September 11th and only recovered to 700 once the traffic has been re-established.
The blue area show the recovery area from the 11th to the 19th. Almost a week was necessay to re-established a proper flight traffic.
An other element to notice is the uneven effect visible for each airport. This is the effect of the weekend/seasonality which drop the number of flights.
I was interested in analyzing this particular set of data. First because it is an accessible piece of information which does not require any background domain knowledge in order to run an analysis. Second, the very large volume of data was a very good chalange to me. I spent a lot of time just loading, parsing and creating a usable dataset in R studio.
I struggled a bit at first with exactly what I expected to learn from this dataset. However, I did a lot of quick coding to reveal some interesting information about the data.
The analysis performed reveal some obvious and non obvious trend in the data but many times I was a bit confused about how to interpret the findings. There are several variables not included as an explanatory or response variable in the dataset but can affect the interpretation of relationships between variables (lurking variable). For example, the aircraft maintenance, the staffing, …
All these features may impact the delay of a flight.
The various regression model I tried to implement was not really conclusive, especially when I try to fit a linear model to each day, predicting delay from time of day. The nature of the data did not reveal obvious correlation.
I was pleased with how easy it is to produce a simple map of the world in R. And plotting points and lines was easy and straight forward.
It really helped to just systematically go through each variable with some quick and dirty plots and see what the data showed me. There is still much to be learned from this dataset, and more questions to find and answer. For example :
This has really helped me flesh out my varied interests in data analytics. I look forward to analyzing more intriguing datasets in the future, and hope you found some of the insights into this dataset as interesting as I did.